Voice
Druid enables AI Agent voice capabilities to meet the demand for hands-free, conversational interactions across various business scenarios. This allows users to communicate with AI Agents naturally through two primary modes:
- Telephony. Users can interact with AI Agents via traditional phone lines. This is ideal for automating call center triage before agent hand-off, or providing automated HR and IT Help Desk support through a dedicated phone number and telephone exchange.
- Voice intranet page. Users can use voice commands directly within a web interface. For example, a user can verbally instruct an AI Agent to perform tasks or edit documents while working within an intranet page.
The voice channel is currently available as a technology preview via the Druid web snippet. You can configure and test voice conversations within the Druid Portal or on hosted web snippets.
How the Voice Channel works with the WebChat snippet
- Press the microphone button in the chat snippet to start talking with the AI Agent.
- Your voice is processed by the Speech-to-Text (STT) service as you speak. You will see the transcript in the input area. When you complete the sentence, the text is sent to the AI Agent.
- The AI Agent processes the text and responds with a text message. You will see the text response in the chat snippet, and the AI Agent will also speak the response to you.
- The spoken response is delivered by the Text-to-Speech (TTS) service.
Set up the speech provider
Druid delivers Speech-to-Text (STT) and Text-to-Speech (TTS) functionality through integrations with industry-leading Technology Partners. Out-of-the-box support includes:
- Microsoft Cognitive Services
- ElevenLabs (available starting with Druid 9.15)
- Deepgram (STT only)
To integrate a preferred speech provider not listed above, please contact Druid Tech Support.
Setting up the Microsoft Cognitive Service
- In the Druid Portal, go to your AI Agent settings.
- Select the AI & Cognitive Services category and click Microsoft Cognitive Service.
- Provide the Key and Region provided by Druid Support Team in the voice channel activation email.
- Map the languages your AI Agent supports to specific voices in the configuration table.
- In the table below the Voice channel details, click the plus icon (+) to add a row.
- From the Language dropdown, select the AI Agent language (default or additional).
- From the Voice dropdown, select the specific voice the AI Agent will use to respond.
- Click the Save icon displayed inline.
- Click Save at the bottom of the page and close the modal.
Setting up Deepgram
Prerequisites
- You need a Deepgram API Key with Member Permissions. Refer to Deepgram documentation (Token-Based Authentication) for information on how to create a key with Member permissions.
Setup procedure
- In the Druid Portal Portal, go to your AI Agent settings.
- Select the AI & Cognitive Services category and click Deepgram.
- Enter your Deepgram API Key.
- Map the languages your AI Agent supports to specific Deepgram models in the configuration table.
- In the table, click the plus icon (+) to add a row.
- From the Language dropdown, select the AI Agent language (default or additional).
- From the Model dropdown, select the specific Deepgram model the AI Agent will use to respond.
- Click the Save icon displayed inline.
-
Click Save at the bottom of the page and close the modal.
- Click on the Webchat channel, select Deepgram as Speech-to-Text Provider and Azure as Text-to-Speech Provider.
- (Optional) Select Azure as a Fallback Speech-to-Text Provider in Webchat. Azure is used when Deepgram does not support the chat user’s language.
- Click the Save button at the bottom of the page.
nova-2-medical). See Deepgram documentation for the complete list of models available.The voice button appears in the chat snippet and users can click on it and speak with the chatbot.
Setting up ElevenLabs
Druid supports ElevenLabs as a high-quality Text-to-Speech (TTS) provider, enabling your AI Agent to communicate using specialized synthetic voices and custom voice clones.
Prerequisites
- You need an ElevenLabs API Key. To get API key, go to https://elevenlabs.io/app/developers/api-keys and copy the key ID.
Setup procedure
- In the Druid Portal Portal, go to your AI Agent settings.
- Select the AI & Cognitive Services category and click ElevenLabs.
- Enter your ElevenLabs API Key.
- Map the languages your AI Agent supports to specific ElevenLabs languages in the configuration table.
- In the table, click the plus icon (+) to add a row.
- From the Language dropdown, select the AI Agent language (default or additional).
- From the Voice dropdown, select the specific ElevenLabs voice the AI Agent will use to respond. The model is automatically filled in after you select the voice.
- Click the Save icon displayed inline.
-
Click Save at the bottom of the page and close the modal.
Configure the Voice channel
Once a speech provider is active, you must explicitly tell the Webchat channel to use these services:
- Select the Web & Email category and click the WebChat channel.
- Select the desired Speech-to-Text Provider. If you selected Deepgram, you should also select Azure as a Fallback Speech-to-Text Provider. Azure will be used automatically Deepgram does not support the user’s language.
- Select the desired Text-to-Speech Provider. If you selected ElevenLabs, you should also select Azure as a Fallback Text-to-Speech Provider. Azure will be used automatically if ElevenLabs does not support the user’s language.
- Click Save at the bottom of the page and close the modal.
A microphone icon will automatically appear in the webchat snippet. This allows users to switch from text to voice conversations seamlessly, enabling natural vocal interaction with the AI Agent.
How the Voice Channel works with SDL Real-time Machine Translation
If you use a translation service for real-time translation and activate the Voice channel, the AI Agent will play back the response in the user's language.
When activating SDL machine translation, you can choose when the translation is performed: at conversation time or authoring time. For more information, see Using Machine Translation.
Voice Channel with Conversation Time Translation
- The user speaks in Language A.
- Speech-to-Text (STT) is performed in Language A.
- The text is translated into the AI Agent default language.
- NLP is performed in the AI Agent default language.
- A response is generated in the AI Agent default language.
- The response is translated back into Language A.
- The AI Agent responds with text in Language A.
- The response text is converted into audio by the Text-to-Speech (TTS) service.
Voice Channel with Authoring Time Translation
When using Authoring Time Translation, Druid translates the message written in the Voice setting of flow steps from the default AI Agent language to all additional languages.
- The user speaks in Language A (default or additional AI Agent language).
- Speech-to-Text (STT) is performed in Language A.
- NLP is performed in Language A.
- The AI Agent responds with text in Language A.
- The response text is converted into audio by the Text-to-Speech (TTS) service and spoken to the user.














